Ctrl Alt Compete

Meet the Team:

Ramya Kurkal (CMDA Major, minor in Mathematics, Senior)\ Eshan Kaul (Computer Science & Economics Double Major, minor in Mathematics, Senior)\ Arya Shah (Computer Science Major, Senior)\ Rohith Mahesh (Computer Science Major, Senior)\ Anurag Kulkarni (Business Information Technology, minor in Computer Science, Junior)

Motivation

Research Question

Can we predict Indian solar power production in order to potentially achieve pay-as-you-go pricing?

Data Introduction

All the data used in this notebook has been retrieved from Kaggle in the form of a comma separated value file. The data was gathered at two solar power plants in India over a 34 day period. The source contains four data sets, however we analyzed two of the four: a plant generation data file and a plant weather sensor data file. The plant generation file is gathered at the inverter level: each inverter has multiple lines of solar panels attached to it (dim: 68778 rows × 7 columns). The sensor data is gathered at a plant level: single array of sensors optimally placed at the plant (dim: 3182 rows × 6 columns).

Read In Data + EDA

Correlation

Definition: Any statistical relationship, whether causal or not, between two random variables or bivariate data. The Correlation Coefficient or Pearson correlation coefficient is commonly obtained by taking the ratio of the covariance of the two variables in question from the numerical dataset, normalized to the square root of their variances. Mathematically, it is the division of the covariance (joint variability of two random variables) of the two variables by the product of their standard deviations.

$$ \rho X,Y = \text{corr(X,Y)} = \frac{\text{cov}(X,Y)}{\sigma_x \sigma_y} = \frac{E[(X-{\mu}x)(Y-{\mu}y)]}{\sigma_x \sigma_y}$$$$ \text{cov}(X, Y) = E[(X - E[X])(Y - E[Y])]$$

For further details see: https://en.wikipedia.org/wiki/Correlation

Exploratory Data Analysis Cntd.

Pearson.png

Rplot1.png

Rplot2.png

EDA Discussion

Correlation Analysis & Pairs Plot

The AC_Power and DC_Power variables are very linearly correlated, we use pairs to discover that there is little multicollinearity in the data which makes our set suitable for multiple machine learning techniques.

Density Plots

From the density plots, we are able to observe the general distribution of our most significant factors. It is clear to see that the AC and DC density plots are very similar which lines up with the argument that the two variables are strongly correlated. We are also able to observe that total yield is on a much larger scale due to the fact that it is a running sum of the daily yield.

Summary Statistics

Basic summary statistics help us understand the nature of the data

Solution Approach

In this notebook, we will explore a machine learning technique called Regression Voting Ensemble with soft voting. Regression Voting Ensemble is a model that works by combining multiple different machine learning techniques in order to average all of their results to choose the optimal prediction for the data.

Soft voting is the process of averaging the probabilities of all the estimators combined.

Within our Ensemble Voting model, we averaged the probability results of Random Forests algorithm, Linear Regression algorithm, Orthogonal Matching Pursuit, and Gradient Boosting Regressor to find the ideal voting regressor for our set.

The OLS is a strong modeling tool that is often used in econometrics for forecasting future markets.

The purpose of the OLS is to take a theoretical equation: $Y_{i} = {\beta}_{0} + {\beta}_{1}X_{i} + {\beta}_{2}X_{i} + {\beta}_{3}X_{i} + {\beta}_{4}X_{i} + {\epsilon}$ (1.1)
to create the estimated equation: $\hat{Y_{i}} = \hat{\beta}_{0} + \hat{\beta}_{1}X_{i} + \hat{\beta}_{2}X_{i} + \hat{\beta}_{3}X_{i} + \hat{\beta}_{4}X_{i}$ (1.2)

This is achived by minimizing the sum of the squared residuals.

$$SS_{res}=\sum{e_i}^2=\sum\limits_{i=1}^n{(y_i-\hat{y_i})^2}=\sum\limits_{i=1}^n(y_i-(\hat{\beta_0}+\hat{\beta_1}x_i))^2$$


We call the above equition the sum of squares for the residuals ($SS_{res}$). Our best estimated line, then, is the one which minimizes the $SS_{res}$.

Minimizing $e$ solvews the normal equations
$$\frac{\partial SS_{res}}{\partial \hat{\beta_0}}=-2\sum\limits_{i=1}^n[y_i-\hat{\beta_0}-\hat{\beta_1}x_i]=0$$
$$\frac{\partial SS_{res}}{\partial \hat{\beta_1}}=-2\sum\limits_{i=1}^n[x_iy_i-\hat{\beta_0}x_i-\hat{\beta_1}x_i^2]=0$$

$$n\hat{\beta_0}+\hat{\beta_1}\sum\limits_{i=1}^n{x_i}=\sum\limits_{i=1}^n{y_i}$$

$$\hat{\beta_0}\sum\limits_{i=1}^n{x_i}+\hat{\beta_1}\sum\limits_{i=1}^n{x_i^2}=\sum\limits_{i=1}^n{x_iy_i}$$
Let $$\bar{x}=\frac{1}{n}\sum\limits_{i=1}^n{x_i}$$ and $$\bar{y}=\frac{1}{n}\sum\limits_{i=1}^n{y_i}$$ be the sample means of predictor values and the responses. If
$$S_{xy}=\sum\limits_{i=1}^n{(x_i-\bar{x})(y_i-\bar{y})}=\sum\limits_{i=1}^ny_i(x_i-\bar{x})=\sum\limits_{i=1}^nx_iy_i-n\bar{x}\bar{y}$$
$$S_{xx} = \sum\limits_{i=1}^n{(x_i-\bar{x})}^2=\sum\limits_{i=1}^n{x_i}-n\bar{x}^2$$
$SS_{xx}$ is the sum of the squares of the difference between each 𝑥 and the mean 𝑥 value, and
$SS_{xy}$ is sum of the product of the difference between 𝑥 its means and the difference between 𝑦 and its mean. $$S_{yy} = \sum\limits_{i=1}^n{(y_i-\bar{y})}^2=\sum\limits_{i=1}^n{y_i}-n\bar{y}^2$$
then the values for $\hat{\beta_0}$ and $\hat{\beta_1}$ minimizing $e$ or, equivalently, solving the normal equations are
$$\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}$$
$$\hat{\beta_1}=\frac{S_{xy}}{S_{xx}}$$

Random Forest

Random forests are a type of ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. In a random forest, each tree is built from a Bootstrap Aggregation or bagging technique from the training set. Bootstrap refers to random sampling with replacement. Bagging is a general procedure that can be used to reduce the variance for those algorithms that have high variance, typically decision trees. Additionally, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. This introduces randomness into the model.

The random forest regressor is used for regression tasks, similar to a random forest classifier. The main difference is that instead of outputting a class, it outputs a continuous value.

During the training process, the algorithm builds multiple decision trees, each of which is trained on a different subset of the data. Each tree makes a prediction, and the final output of the random forest regressor is the average of all of the tree's predictions.

In addition to averaging the predictions, the use of multiple decision trees allows for the model to capture a wider range of relationships in the data. The random subsampling of features when splitting a node also helps to decorrelate the trees, which reduces overfitting and improves the overall performance of the model.

Let $X$ be a training set where $X = x_1, ..., x_n$ with responses $Y = y_1, ..., y_n$ Drawing $B$ bootstrap samples (selecting random samples with replacement) yeilds tree classifiers $h_1, ..., h_B$

Then combining the classifiers we have:

$$ h(x) = \left\{ \begin{array}{ll} 1 & \text{if } \frac{1}{B}\sum_{j}{h_{j}(x)} \le \frac{1}{2} \\ 0 & otherwise \\ \end{array} \right. $$

For each tree an estimate prediction error is generated.

$$\hat{m}(x) = \frac{1}{M}\sum_{j}{\hat{m}_{j}(x)}$$

Where $\hat{m}_{j}$ is the tree estimator based on the bootstraped sample of size $a$ using $p$ randomly selected features. Taking the average of the prediction errors yeilds the out of bag error estimate.

K Nearest Neighbor

Gradient Boosting

Gradient Boosting is an ensemble machine learning algorithm that can be used for both classification and regression tasks. It is a type of boosting algorithm, which works by combining several weak models (decision trees) to create a strong model.

The goal of gradient boosting is to train a sequence of decision trees, where each tree tries to correct the mistakes made by the previous tree. The algorithm starts by fitting a simple decision tree to the data, and then uses the residuals (the difference between the predicted values and the true values) as the target for the next tree. This process is repeated multiple times, and the final prediction is obtained by combining the predictions of all the trees.

Mathmatically we seek to find the nonlinear predictor $\hat{h}(x) \in \mathcal{H}$ such that $$\hat{h} =\underset{h \in \mathcal{H}}{\operatorname{arg max}}{\mathcal{L}(h(X), Y)}$$

The main advantage of Gradient Boosting is that it can handle non-linear relationships between the input variables and the output variable, and it can also handle missing values and outliers in the data.

One of the key components behind Gradient Boosting is the concept of gradient descent, which is used to minimize the loss function. In Gradient Boosting, the loss function is typically the mean squared error (MSE) or the mean absolute error (MAE) for regression problems and the cross-entropy loss for classification problems.

The square loss (regression) function can be represented mathematically as: $$\mathcal{L}(h(X), Y) = \sum_{i=1}^{n} (h(x_i) - y_i)^2$$

Partial Least Squares Cononical

PLS Cononical is a type of multivariate statistical analysis that is useful for analyzing the relationship between a large number of X variables and a small number of Y variables. The goal of the analysis is to find the linear combinations of X variables that explain the most variance in the Y variables. This is done by finding linear combinations of X variables that are highly correlated with the Y variables.

Procedure: The PLS canoncial method starts by defining a set of latent components or latent factors, that are linear combinations of the X variables. These latent variables are then used to model the Y variables. The method finds the linear combination of the X variables that maximizes the covariance between the latent variables and the Y variables.

Given two centered matrices $X \in \mathbb{R}^{n \times d}$ and $Y \in \mathbb{R}^{n \times t}$, and a number of components K, the PLSCanonical proceedure is:

Set $X_1$ to $X$ and $Y_1$ to $Y$. For each $k \in [1, K]$:

compute $u_k \in \mathbb{R}^d$ and $u_k \in \mathbb{R}^d$ the first left and right singular vectors of the cross-covariance matrix $C = X_k^T Y_k$ by computing the SVD of C storing the largest singular vectors. $u_k$ and $v_k$ are weights chosen to maximize the covariance between projected $X_k$ and the projected target $\text{Cov}(X_k u_k, Y_k v_k)$

Project $X_k$ and $Y_k$ on the singular vectors to obtain the scores: $\xi_k = X_k u_k$ and $\omega_k = Y_k v_k$

Then regress $X_k$ on $\xi_k$ finding a loading vector $\gamma_k \in \mathbb{R}^d$ such that the rank 1 matrix $\xi_k \gamma_k^T$ minimizes the distance to $X_k$

Regressing $Y_k$ on $\gamma_k$ similarly yeilds the loading vector $\delta_k$

Finally deflate $X_k$ and $Y_k$ by taking the difference of the rank 1 approximations: $X_{k+1} = X_k - \xi_k \gamma_k^T$ and $Y_{k + 1} = Y_k - \omega_k \delta_k^T$

The resulting model approximates $X$, as a sum of rank 1 matrices $X = \Xi \Gamma^T$ where columns $\Xi \in \mathbb{R}^{n \times K}$ contains the scores and rows $\Gamma^T \in \mathbb{R}^{K \times d}$ contains the loading vectors, and $Y = \Omega \Delta^T$ where $\Xi$ and $\Omega$ are projections for the traing data X and Y.

Orthogonal Matching Pursuit

Interpretations of Results

By observing our Ensemble Voting model, we are able to predict the ideal values combining the results of Linear Regression (model accuracy: 98.16%), Random Forests (model accuracy: 99.0565%), Gradient Boosting (model accuracy: 98.4304%), and Orthogonal Matching Pursuit (model accuracy: 98.1363%). By averaging all of these probabilities, we are able to conclude that the overall Ensemble model accuracy is 98.7468%. By using Ensemble Voting Regression, we are able to account for possible overfitting of the data by individual machine learning models. Instead, we are able to optimize the prediction choice by averaging the results of all of them.

Limitations

Potential Ethical Concerns

Citations:

https://ourworldindata.org/renewable-energy https://www.kaggle.com/datasets/anikannal/solar-power-generation-data?resource=download https://www.nachi.org/advantages-solar-energy.htm https://www.energy.gov/energysaver/benefits-residential-solar-electricity https://www.investopedia.com/articles/investing/053015/pros-and-cons-solar-energy.asp https://machinelearningmastery.com/voting-ensembles-with-python/

Other Methods Explored ⬇

https://scikit-learn.org/stable/modules/ensemble.html

https://www.kaggle.com/code/bpncool/data-exploration-visualize-hourly-variation

https://www.kaggle.com/code/lumierebatalong/solar-power-machine-learning-i

https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html#sphx-glr-auto-examples-ensemble-plot-stack-predictors-py